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EFFICIENT IMPLEMENTATION OF JOINT OPTIMIZATION 
OF EXCITATION AND MODEL PARAMETERS IN 
MULTIPULSE SPEECH CODERS 



BACKGROUND 

The present invention relates generally to speech encoding, and more 
particularly, to an efficient encoder that employs sparse excitation pulses. 

Speech compression is a well known technology for encoding speech 
into digital data for transmission to a receiver which then reproduces the 
speech. The digitally encoded speech data can also be stored in a variety of 
digital media between encoding and later decoding (i.e., reproduction) of the 
speech. 

Speech coding systems differ from other analog and digital encoding 
systems that directly sample an acoustic sound at high bit rates and transmit 
the raw sampled data to the receiver. Direct sampling systems usually 
produce a high quality reproduction of the original acoustic sound and is 
typically preferred when quality reproduction is especially important. Common 
examples where direct sampling systems are usually used include music 
phonographs and cassette tapes (analog) and music compact discs and 
DVDs (digital). One disadvantage of direct sampling systems, however, is the 
large bandwidth required for transmission of the data and the large memory 
required for storage of the data. Thus, for example, in a typical encoding 
system which transmits raw speech data sampled from an original acoustic 
sound, a data rate as high as 128,000 bits per second is often required. 

In contrast, speech coding systems use a mathematical model of 
human speech production. The fundamental techniques of speech modeling 
are known in the art and are described in B.S. Atal and Suzanne L. Hanauer, 
Speech Analysis and Synthesis by Linear Prediction of the Speech Wave, 
The Journal of the Acoustical Society of America, 637-55 (vol. 50 1971). The 
model of human speech production used in speech coding systems is usually 
referred to as the source-filter model. Generally, this model includes an 
excitation signal that represents airflow produced by the vocal folds, and a 



synthesis filter that represents the vocal tract (i.e., the glottis, mouth, tongue, 
nasal cavities and lips). Therefore, the excitation signal acts as an input 
signal to the synthesis filter similar to the way the vocal folds produce air flow 
to the vocal tract. The synthesis filter then alters the excitation signal to 
represent the way the vocal tract manipulates the air flow from the vocal folds. 
Thus, the resulting synthesized speech signal becomes an approximate 
representation of the original speech. 

One advantage of speech coding systems is that the bandwidth 
needed to transmit a digitized form of the original speech can be greatly 
reduced compared to direct sampling systems. Thus, by comparison, 
whereas direct sampling systems transmit raw acoustic data to describe the 
original sound, speech coding systems transmit only a limited amount of 
control data needed to recreate the mathematical speech model. As a result, 
a typical speech synthesis system can reduce the bandwidth needed to 
transmit speech to between about 2,400 to 8,000 bits per second. 

One problem with speech coding systems, however, is that the quality 
of the reproduced speech is sometimes relatively poor compared to direct 
sampling systems. Most speech coding systems provide sufficient quality for 
the receiver to accurately perceive the content of the original speech. 
However, in some speech coding systems, the reproduced speech is not 
transparent. That is, while the receiver can understand the words originally 
spoken, the quality of the speech may be poor or annoying. Thus, a speech 
coding system that provides a more accurate speech production model is 
desirable. 

One solution that has been recognized for improving the quality of 
speech coding systems is described in U.S. Patent Application 09/800,071 to 
Lashkari et al., hereby incorporated by reference. Briefly stated, this solution 
involves minimizing a synthesis error between an original speech sample and 
a synthesized speech sample. One difficulty that was discovered in that 
speech coding system, however, is the highly nonlinear nature of the 
synthesis error, which made the problem mathematically ill-behaved. This 
difficulty was overcome by solving the problem using the roots of the 



synthesis filter polynomial instead of coefficients of the polynomial. 
Accordingly, a root optimization algorithm is described therein for finding the 
roots of the synthesis filter polynomial. 

One improvement upon the above-mentioned solution is described in 

U.S. Patent Application to Lashkari et al. (Attorney Docket No. 

10745/20). This improvement describes an improved gradient search 
algorithm that may be used with iterative root searching algorithms. Briefly 
stated, the improved gradient search algorithm recalculates the gradient 
vector at each iteration of the optimization algorithm to take into account the 
variations of the decomposition coefficients with respect to the roots. Thus, 
the improved gradient search algorithm provides a better set of roots 
compared to algorithms that assume the decomposition coefficients are 
constant during successive iterations. 

One remaining problem with the optimization algorithm, however, is the 
large amount of computational power that is required to encode the original 
speech. As those in the art well know, a central processing unit ("CPU") or a 
digital signal processor ("DSP") must be used by speech coding systems to 
calculate the various mathematical formulas used to code the original speech. 
Oftentimes, when speech coding is performed by a mobile unit, such as a 
mobile phone, the CPU or DSP is powered by an onboard battery. Thus, the 
computational capacity available for encoding speech is usually limited by the 
speed of the CPU or DSP or the capacity of the battery. Although this 
problem is common in all speech coding systems, it is especially significant in 
systems that use optimization algorithms. Typically, optimization algorithms 
provide higher quality speech by including extra mathematical computations in 
addition to the standard encoding algorithms. However, inefficient 
optimization algorithms require more expensive, heavier and larger CPUs and 
DSPs which have greater computational capacity. Inefficient optimization 
algorithms also use more battery power, which results in shortened battery 
life. Therefore, an efficient optimization algorithm is desired for speech coding 
systems. 



BRIEF SUMMARY 

Accordingly, an efficient speech coding system is provided for 
optimizing the mathematical model of human speech production. The efficient 
encoder includes an improved optimization algorithm that takes into account 
the sparse nature of the multipulse excitation by performing the computations 
for the gradient vector only where the excitation pulses are non-zero. As a 
result, the improved algorithm significantly reduces the number of calculations 
required to optimize the synthesis filter. In one example, calculation efficiency 
is improved by approximately 87% to 99% without changing the quality of the 
encoded speech. 

BRIEF DESCRIPTION OF SEVERAL VIEWS OF THE DRAWINGS 

The invention, including its construction and method of operation, is 
illustrated more or less diagrammatically in the drawings, in which: 

Figure 1 is a block diagram of a speech analysis-by-synthesis system; 

Figure 2A is a flow chart of the speech synthesis system using model 
optimization only; 

Figure 2B is a flow chart of an alternative speech synthesis system 
using joint optimization of the model parameters and the excitation signal; 

Figure 3 is a flow chart of computations used in the efficient 
optimization algorithm; 

Figure 4 is a timeline-amplitude chart, comparing an original speech 
sample to a multipulse LPC synthesized speech and an optimally synthesized 
speech; 

Figure 5 is a chart, showing synthesis error reduction and improvement 
as a result of the optimization; and 

Figure 6 is a spectral chart, comparing the spectra of the original 
speech sample to an LPC synthesized speech and an optimally synthesized 
speech. 



DESCRIPTION 

Referring now to the drawings, and particularly to Figure 1 , a speech 
coding system is provided that minimizes the synthesis error in order to more 
accurately model the original speech. In Figure 1, an analysis-by-synthesis 
("AbS") system is shown which is commonly referred to as a source-filter 
model. As is well known in the art, source-filter models are designed to 
mathematically model human speech production. Typically, the model 
assumes that the human sound-producing mechanisms that produce speech 
remain fixed, or unchanged, during successive short time intervals, or frames 
(e.g., 10 to 30 ms analysis frames). The model further assumes that the 
human sound producing mechanisms can change between successive 
intervals. The physical mechanisms modeled by this system include air 
pressure variations generated by the vocal folds, glottis, mouth, tongue, nasal 
cavities and lips. Thus, the speech decoder reproduces the model and 
recreates the original speech using only a small set of control data for each 
interval. Therefore, unlike conventional sound transmission systems, the raw 
sampled data of the original speech is not transmitted from the encoder to the 
decoder. As a result, the digitally encoded data that is actually transmitted or 
stored (i.e., the bandwidth, or the number of bits) is much less than those 
required by typical direct sampling systems. 

Accordingly, Figure 1 shows an original digitized speech 10 delivered 
to an excitation module 12. The excitation module 12 then analyzes each 
sample s(n) of the original speech and generates an excitation function u(n). 
The excitation function u(n) is typically a series of pulses that represent air 
bursts from the lungs which are released by the vocal folds to the vocal tract. 
Depending on the nature of the original speech sample s(n), the excitation 
function u(n) may be either a voiced 13, 14 or an unvoiced signal 15. 

One way to improve the quality of reproduced speech in speech coding 
systems involves improving the accuracy of the voiced excitation function 
u(n). Traditionally, the excitation function u(n) has been treated as a series of 
pulses 13 with a fixed magnitude G and period P between the pitch pulses. 
As those in the art well know, the magnitude G and period P may vary 



between successive intervals. In contrast to the traditional fixed magnitude G 
and period P, it has previously been shown to the art that speech synthesis 
can be improved by optimizing the excitation function u(n) by varying the 
magnitude and spacing of the excitation pulses 14. This improvement is 
described in Bishnu S. Atal and Joel R. Remde, A New Model ofLPC 
Excitation For Producing Natural-Sounding Speech At Low Bit Rates, IEEE 
International Conference On Acoustics, Speech, And Signal Processing 614- 
17 (1982). This optimization technique usually requires more intensive 
computing to encode the original speech s(n). However, in prior systems, this 
problem has not been a significant disadvantage since modern computers 
usually provide sufficient computing power for optimization 14 of the excitation 
function u(n). A greater problem with this improvement has been the 
additional bandwidth that is required to transmit data for the variable excitation 
pulses 14. One solution to this problem is a coding system that is described 
in Manfred R. Schroederand Bishnu S. Atal, Code-Excited Linear Prediction 
(CELP): High-Quality Speech At Very Low Bit Rates, IEEE International 
Conference On Acoustics, Speech, And Signal Processing, 937-40 (1985). 
This solution involves categorizing a number of optimized excitation functions 
into a library of functions, or a codebook. The encoding excitation module 12 
will then select an optimized excitation function from the codebook that 
produces a synthesized speech that most closely matches the original speech 
s(n). Next, a code that identifies the optimum codebook entry is transmitted to 
the decoder. When the decoder receives the transmitted code, the decoder 
then accesses a corresponding codebook to reproduce the selected optimal 
excitation function u(n). 

The excitation module 12 can also generate an unvoiced 15 excitation 
function u(n). An unvoiced 15 excitation function u(n) is used when the 
speaker's vocal folds are open and turbulent air flow is produced through the 
vocal tract. Most excitation modules 12 model this state by generating an 
excitation function u(n) consisting of white noise 15 (i.e., a random signal) 
instead of pulses. 
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In one example of a typical speech coding system, an analysis frame of 
10 ms may be used in conjunction with a sampling frequency of 8 kHz. Thus, 
in this example, 80 speech samples are taken and analyzed for each 10 ms 
frame. In standard linear predictive coding ("LPC") systems, the excitation 
5 module 12 usually produces one pulse for each analysis frame of voiced 

sound. By comparison, in code-excited linear prediction ("CELP") systems, 
the excitation module 12 will usually produce about ten pulses for each 
analysis frame of voiced speech. By further comparison, in mixed excitation 
linear prediction ("MELP") systems, the excitation module 12 generally 

1 0 produces one pulse for every speech sample, that is, eighty pulses per frame 

in the present example. 

Next, the synthesis filter 16 models the vocal tract and its effect on the 
air flow from the vocal folds. Typically, the synthesis filter 16 uses a 
polynomial equation to represent the various shapes of the vocal tract. This 

1 5 technique can be visualized by imagining a multiple section hollow tube with 

several different diameters along the length of the tube. Accordingly, the 
synthesis filter 16 alters the characteristics of the excitation function u(n) 
similar to the way the vocal tract alters the air flow from the vocal folds, or in 
other words, like the variable diameter hollow tube example alters inflowing 

20 air. 

According to Atal and Remde, supra., the synthesis filter 16 can be 
represented by the mathematical formula: 



H(z) = G/A(z) (1) 

where G is a gain term representing the loudness of the voice. A(z) is a 
polynomial of order M and can be represented by the formula: 



A(z) = 1+2) a k z" k 



(2) 



-7- 



The order of the polynomial A(z) can vary depending on the particular 
application, but a 10th order polynomial is commonly used with an 8 kHz 
sampling rate. The relationship of the synthesized speech s(n) to the 
excitation function u(n) as determined by the synthesis filter 16 can be defined 
by the formula: 

M 

s(n) = Gu(n)- £ a k s(n-k) (3) 



Conventionally, the coefficients a 1 . . . a M of this plynomial are 
computed using a technique known in the art as linear predictive coding 
("LPC"). LPC-based techniques compute the polynomial coefficients ai . . . a M 
by minimizing the total prediction error E p . Accordingly, the sample prediction 
error e p (n) is defined by the formula: 

M 

e p (n) = s(n) + g a k s(n-k) (4) 
The total prediction error E p is then defined by the formula: 

E p = Z e p 2 ( k ) (5) 



where N is the length of the analysis frame expressed in number of samples. 
The polynomial coefficients a-i . . . a M can now be computed by minimizing the 
total prediction error E p using well known mathematical techniques. 

One problem with the LPC technique of computing the polynomial 
coefficients a-, . . . a M is that only the total prediction error is minimized. Thus, 
the LPC technique does not minimize the error between the original speech 
s(n) and the synthesized speech s(n). Accordingly, the sample synthesis 
error e s (n) can be defined by the formula: 



e s (n) = s(n) - s(n) 



(6) 



The total synthesis error E s can then be defined by the formula: 



N-1 N-1 



E s = S e s 2 (n) = £(s(n)-s(n)) 2 



(7) 



where as before, N is the length of the analysis frame in number of samples. 
Like the total prediction error E p discussed above, the total synthesis error E s 
should be minimized to compute the optimum filter coefficients a-, . . . a M . 
However, one difficulty with this technique is that the synthesized speech s(n), 
as represented in formula (3), makes the total synthesis error E s a highly 
nonlinear function that is not generally well-behaved mathematically. 

One solution to this mathematical difficulty is to minimize the total 
synthesis error E s using the roots of the polynomial A(z) instead of the 
coefficients a-i . . . a M . Using roots instead of coefficients for optimization also 
provides control over the stability of the synthesis filter 16. Accordingly, 
assuming that h(n) is the impulse response of the synthesis filter 16, the 
synthesized speech s(n) is now defined by the formula: 



where * is the convolution operator. In this formula, it is also assumed that 
the excitation function u(n) is zero outside of the interval 0 to N-1 . 

In LPC and multipulse encoders, the excitation function u(n) is 
relatively sparse. That is, non-zero pulses occur at only a few samples in the 
entire analysis frame, with most samples in the analysis frame having no 
pulses. For LPC encoders, as few as one pulse per frame may exist, while 
multipulse encoders may have as few as 10 pulses per frame. Accordingly, 
N p may be defined as the number of excitation pulses in the analysis frame, 



s(n) = h(n)*u(n) = £ h(k)u(n-k) 



(8) 



k=0 
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and p(k) may be defined as the pulse positions within the frame. Thus, the 
excitation function u(n) can be expressed by the formulas: 

u(p(k))*0 for k = 1,2...N p (9a) 
u(n) = 0 for n^p(k) (9b) 

Hence, the excitation function u(n) for a given analysis frame includes N p 
pulses at locations defined by p(k) with the amplitudes defined by u(p(k)). 

By substituting formulas (9a) and (9b) into formula (8), the synthesized 
speech s(n) can now be expressed by the formula: 

F(n) 

s(n) = h(n)*u(n) = £ h(n-p(k))u(p(k)) (1 0) 

k=0 

where F(n) is the number of pulses up to and including sample n in the 
analysis frame. Accordingly, the function F(n) satisfies the following 
relationships: 

P(F(n))<n (11a) 
F (n)^N p (11 b ) 

This relationship for F(n) is preferred because it guarantees that (n-p(k)) will 
be non-negative. 

From the foregoing, it can now be shown that formula (8) requires n 
multiplications and n additions in order to compute the synthesized speech at 
sample n. Accordingly, the total number of multiplications and additions N T 
that are required for a given frame of length N is given by the formula: 

N T =N(N+1)/2 (12) 

Thus, the resulting number of computations required is given by a quadratic 
function defined by the length of the analysis frame. Therefore, in the 
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aforementioned example, the total number N T of computations required by 
formula (8) may be as many as 3,240 (i.e., 80(80+1 )/2) for a 10 ms frame. 

On the other hand, it can be shown that the maximum number N't of 
computations required to compute the synthesized speech using formula (10) 
5 can be closely approximated by the formula: 

N' T = N P N (13) 

where N p is the total number of pulses in the frame. Formula (13) represents 
10 the maximum number of computations that may be required assuming that 

the pulses are nonuniformly distributed. If pulses are uniformly distributed in 
the analysis frame, the total number N" T of computations required by formula 
10 is given by the formula: 

15 N" T = N p N/2 ( 14) 

Therefore, using the aforementioned example again, the total number N" T of 
computations required by formula (10) may be as few as 400 (i.e., 10(80)/2) 
for a RPE (Regular Pulse Excitation) multipulse encoder. By comparison, 
20 formula (10) may require as few as 40 computations (i.e., 1 (80)/2) for an LPC 

encoder. 

One advantage of the improved optimization algorithm can now be 
appreciated. The computation of the synthesized speech s(n) using the 
convolution of the impulse response h(n) and the excitation function u(n) 

25 requires far fewer calculations than previously required. Thus, whereas about 

3,240 computations were previously required, only 400 computations are now 
required for RPE multipulse encoders and only 40 computations for LPC 
encoders. This improvement results in about an 87% reduction in 
computational load for RPE encoders and about a 99% reduction for LPC 

30 encoders. 

Using the roots of A(z), the polynomial can now be expressed by the 
formula: 
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A(z) = (1-^z- 1 ) 



(15) 



where ^ . . . X M represent the roots of the polynomial A(z). These roots may 
be either real or complex. Thus, in the preferred 10th order polynomial, A(z) 
will have 1 0 different roots. 

Using parallel decomposition, the synthesis filter transfer function H(z) 
is now represented in terms of the roots by the formula: 



(the gain term G is omitted from this and the remaining formulas for 
simplicity). The decomposition coefficients are then calculated by the 
residue method for polynomials, thus providing the formula: 



The impulse response h(n) can also be represented in terms of the roots by 
the formula: 



Next, by combining formula (18) with formula (8), the synthesized 
speech S(n) can be expressed by the formula: 



H(z) = 1/A(z) = £ b/(1-V" 1 ) 



(16) 



(17) 



h(n) = ]T b .(^) r 



(18) 



s(n) = J) h(k)u(n - k) = ^ u(n - k) f] b, (A, ) k 




(19) 



By substituting formulas (9a) and (9b) into formula (19), the synthesized 
speech s(n) can now be efficiently computed by the formula: 



n F(n) M 

s(n) = £ h(k)u(n-k) = J u(p(k)) £ bi(Xi) n -PW 



(20) 



k=0 K=1 i=i 



where F(n) is defined by the relationship in formula (1 1 ). As previously 
described, formula (20) is about 87% more efficient than formula (19) for 
multipulse encoders and is about 99% more efficient for LPC encoders. 

The total synthesis error E s can be minimized using polynomial roots 
and a gradient search algorithm by substituting formula (20) into formula (7). 
A number of optimization algorithms may be used to minimize the total 
synthesis error E s . However, one possible algorithm is an iterative gradient 
search algorithm. Accordingly, denoting the root vector at the j-th iteration as 
A a) , the root vector can be expressed by the formula: 



where is the value of the r-th root at the j-th iteration and T is the 
transpose operator. The search begins with the LPC solution as the starting 
point, which is expressed by the formula: 



To compute A (0) , the LPC coefficients a^ . . . a M are converted to the 
corresponding roots A, (0) . . . ^ <0) using a standard root finding algorithm. 

Next, the roots at subsequent iterations can be computed using the 
formula: 



A w = [A® ...A® ...V 0 ] 1 



(21) 



A (0) = [A™ ...V 0) ... 



(22) 



A G+1> = A (J) + M y jEs 



(23) 
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where n is the step size and VjE s is the gradient of the synthesis error E s 
relative to the roots at iteraton j. The step size (x can be either fixed for each 
iteration, or alternatively, it can be variable and adjusted for each iteration. 
Using formula (7), the synthesis error gradient vector VjE s can now be 
calculated by the formula: 

= £ (sM-sMJVjSfk) (24) 

k = 1 

Formula (24) demonstrates that the synthesis error gradient vector VjE s 
can be calculated using the gradient vectors of the synthesized speech 
samples s(k). Accordingly, the synthesized speech gradient vector Vjs(k) can 
be defined by the formula: 

v J §(k) - [Ss(k)/5A, 0) . . . as(k)/aL r 0) ...ds(k)fdA M U) ] (25) 

where ds(k)/dA? is the partial derivative of s(k) at iteration j with respect to the 
r-th root. Using formula (19), the partial derivatives 5s(k)/aA r (D can be 
computed by the formula: 

ds(k)ldX r Q) = b r £ mu(k - m) (A r 0) ) (m " 1) k > 1 (26) 

where ds(0)/dJlf is always zero. 

By substituting formulas (9a) and (9b) into formula (26), the 
synthesized speech s(n) can now be expressed by the formula: 

3s(k)/a r G) =b r (k-p(m))u(p(m))(X r (i »)< k -P< m '- 1 > (27) 
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where F(n) is defined by the relationship in formula (11). Like formulas (10) 
and (20), the computation of formula (27) will require far fewer calculations 
compared to formula (26). 

The synthesis error gradient vector VjE s is now calculated by 
substituting formula (27) into formula (25) and formula (25) into formula (24). 
The updated root vector A a+1) at the next iteration can then be calculated by 
substituting the result of formula (24) into formula (23). After the root vector 
A 0) is recalculated, the decomposition coefficients bi are updated prior to the 
next iteration using formula (17). A detailed description of one algorithm for 
updating the decomposition coefficients is described in U.S. patent application 

number to Lashkari et al. (Attorney Docket No. 10745/20). The 

iterations of the gradient search algorithm are repeated until either the step- 
size becomes smaller than a predefined value ^ min , a predetermined number 
of iterations are completed, or the roots are resolved within a predetermined 
distance from the unit circle. 

Although control data for the optimal synthesis polynomial A(z) can be 
transmitted in a number of different formats, it is preferable to convert the 
roots found by the optimization technique described above back into 
polynomial coefficients a-\ . . . a M . The conversion can be performed by well 
known mathematical techniques. This conversion allows the optimized 
synthesis polynomial A(z) to be transmitted in the same format as existing 
speech coding systems, thus promoting compatibility with current standards. 

Now that the synthesis model has been completely determined, the 
control data for the model is quantized into digital data for transmission or 
storage. Many different industry standards exist for quantization. However, in 
one example, the control data that is quantized includes ten synthesis filter 
coefficients . . . a 10 , one gain value G for the magnitude of the excitation 
pulses, one pitch period value P for the frequency of the excitation pulses, 
and one indicator for a voiced 13 or unvoiced 15 excitation function u(n). As 
is apparent, this example does not include an optimized excitation pulse 14, 
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which could be included with some additional control data. Accordingly, the 
described example requires the transmission of thirteen different variables at 
the end of each speech frame. Commonly, in CELP encoders the control 
data are quantized into a total of 80 bits. Thus, according to this example, the 
synthesized speech s(n), including optimization, can be transmitted within a 
bandwidth of 8,000 bits/s (80 bits/frame h- .010 s/frame). 

As shown in both Figure 1 and 2, the order of operations can be 
changed depending on the accuracy desired and the computing resources 
available. Thus, in the embodiment described above, the excitation function 
u(n) was first determined to be a preset series of pulses 13 for voiced speech 
or an unvoiced signal 15. Second, the synthesis filter polynomial A(z) was 
determined using conventional techniques, such as the LPC method. Third, 
the synthesis polynomial A(z) was optimized. 

In Figure 2A and 2B, a different encoding sequence is shown that is 
applicable to multipulse and CELP-type speech coders which should provide 
even more accurate synthesis. However, some additional computing power 
will be needed. In this sequence, the original digitized speech sample 30 is 
used to compute 32 the polynomial coefficients a-, . . . a M using the LPC 
technique described above or another comparable method. The polynomial 
coefficients . . . a M , are then used to find 36 the optimum excitation function 
u(n) from a codebook. Alternatively, an individual excitation function u(n) can 
be found 40 from the codebook for each frame. After selection of the 
excitation function u(n), the polynomial coefficients a 1 . . . a M are then also 
optimized. To make optimization of the coefficients a-i . . . a M easier, the 
polynomial coefficients a^ . . . a M are first converted 34 to the roots of the 
polynomial A(z). A gradient search algorithm is then used to optimize 38, 42, 
44 the roots. Once the optimal roots are found, the roots are then 
converted 46 back to polynomial coefficients a^ . . . a M for compatibility with 
existing encoding-decoding systems. Lastly, the synthesis model and the 
index to the codebook entry are quantized 48 for transmission or storage. 

Additional encoding sequences are also possible for improving the 
accuracy of the synthesis model depending on the computing capacity 
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available for encoding. Some of these alternative sequences are 
demonstrated in Figure 1 by dashed routing lines. For example, the excitation 
function u(n) can be reoptimized at various stages during encoding of the 
synthesis model. 

Figure 3 shows a sequence of computations that requires fewer 
calculations to optimize the synthesis polynominal A(z). The sequence shows 
the computations for one frame 50 and are repeated for each frame 62 of 
speech. First, the synthesized speech §(n) is computed for each sample in 
the frame using formula (10) 52. The computation of the synthesized speech 
is repeated until the last sample in the frame has been computed 54. The 
roots of the synthesis filter polynomial A(z) are then computed using a 
standard root finding algorithm 56. Next, roots of the synthesis polynominal 
are optimized with an iterative gradient search algorithm using formulas (27), 
(25), (24) and (23) 58. The iterations are then repeated until a completion 
criteria is met, for example if an iteration limit is reached 60. 

It is now apparent to those skilled in the art that the efficient 
optimization algorithm significantly reduces the number of calculations 
required to optimize the synthesis filter polynomial A(z). Thus, the efficiency 
of the encoder is greatly improved. Using previous optimization algorithms, 
the computation of the synthesized speech s(n) for each sample was a 
computationally intensive task. However, the improved optimization algorithm 
reduces the computational load required to compute the synthesized speech 
s(n) by taking into account the sparse nature of the excitation pulses, thereby 
minimizing the number of calculations performed. 

Figures 4-6, show the results provided by the more efficient 
optimization algorithm. The figures show several different comparisons 
between a prior art multipulse LPC synthesis system and the optimized 
synthesis system. The speech sample used for this comparison is a segment 
of a voiced part of the nasal "m". As shown in the figures, another advantage 
of the improved optimization algorithm is that the quality of the speech 
synthesis optimization is unaffected by the reduced number of calculations. 
Accordingly, the optimized synthesis polynominal that is computed using the 



more efficient optimization algorithm is exactly the same as the optimized 
synthesis polynominal that would result without reducing the number of 
calculations. Thus, less expensive CPUs and DSPs may be used and battery 
life may be extended without sacrificing speech quality. 

In Figure 4, a timeline-amplitude chart of the original speech, a prior art 
multipulse LPC synthesized speech and the optimized synthesized speech is 
shown. As can be seen, the optimally synthesized speech matches the 
original speech much closer than the LPC synthesized speech. 

In Figure 5, the reduction in the synthesis error is shown for successive 
iterations of the optimization algorithm. At the first iteration, the synthesis 
error equals the LPC synthesis error since the LPC coefficients serve as the 
starting point for the optimization. Thus, the improvement in the synthesis 
error is zero at the first iteration. Accordingly, the synthesis error steadily 
decreases with each iteration. Noticeably, the synthesis error increases (and 
the improvement decreases) at iteration number three. This characteristic 
occurs when the updated roots overshoot the optimal roots. After 
overshooting the optimal roots, the search algorithm takes the overshoot into 
account in successive iterations, thereby resulting in further reductions in the 
synthesis error. In the example shown, the synthesis error can be seen to be 
reduced by 37% after six iterations. Thus, a significant improvement over the 
LPC synthesis error is possible with the optimization. 

Figure 6 shows a spectral chart of the original speech, the LPC 
synthesized speech and the optimally synthesized speech. The first spectral 
peak of the original speech can be seen in this chart at a frequency of about 
280 Hz. Accordingly, the optimized synthesized speech waveform matches 
the 280 Hz component of the original speech much better than the LPC 
synthesized speech waveform. 

While preferred embodiments of the invention have been described, it 
should be understood that the invention is not so limited, and modifications 
may be made without departing from the invention. The scope of the 
invention is defined by the appended claims, and all devices that come within 
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the meaning of the claims, either literally or by equivalence, are intended to be 
embraced therein. 
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